Are you a broke college student? Have you ever wondered when the heck flight prices drop, and if there is such thing as an optimal time to splurge on tickets to see your long distance partner? This started off as an obsessive screenshotting extravaganza that involved opening Google Flights, searching up the desired flight, and copy + pasting the price history graphs of the exact same flight path on the same day of the week but also for the weeks before and after the desired date. That information still resides in my initial mess of a Google Docs file, but I’ve decided that it would be a shame of me as a data scientist and MS Candidate to keep eyeballing everything instead of just making my own graphs in R, so now I’ve regretfully taken it to the next level by fiddling with R code and making this a more official-looking data science project.
I’ve been tracking flight prices from Newark Liberty International Airport (EWR) to Minneapolis-St. Paul International Airport (MSP) for the following Mondays: March 18, March 25, April 8, and April 15, 2024. Specifically, I’ve been logging (no, not the math definition) the lowest price for a one-way nonstop ticket for each day. Most of my data come from Google Flights, though I’ve noticed a few discrepencies in their data: a few days ago, the website highlighted Sun Country Airlines for having the alleged lowest price (on another flight I was looking at, but not any of the ones in this analysis), but they actually had already increased that original price from the previous day, so Google was wrong(!!). [3/5/24 Update: There also are a few other discrepencies between the Google Flight history and my own records near the end of Feburary, aka when I was able to look up the flights on United.com myself and just screenshot the fare calendar and record the numbers in my CSV files. However, I expect most of the past data to be accurate.]
The prices I’ve been tracking on the four departure days are each contained in their own CSV file. My desired flight is the one on April 8. I skipped April 1, since those prices seemed a lot higher overall, likely due to Easter weekend. Though I tracked April 15’s prices since it is the following week’s Monday, I didn’t end up using it, since Sun Country’s flight would have been the cheapest one, and they don’t have any flights for the other three Mondays in the dataset (whose flights are mainly operated by Delta and United Airlines). Sun Country is known for being a budget airline that offers flights to and from MSP, so I omitted April 15 from my analysis. (It’s still loaded in the data, though.)
Oh, right! Where’s our data documentation? Here it is:
| Column Name | type | Description |
|---|---|---|
| flight_date | chr |
one of March 18, March 25, April 8, or April 15 |
| days_before | int |
number of days before the flight date |
| lowest_price | int |
lowest price (USD) for a one-way nonstop ticket (contingent upon Google’s psychic powers that day) |
| day | chr |
day of the week (here, it’s just Mondays) |
| origin | chr |
origin airport (all EWR, nothing to see) |
| destination | chr |
destination airport (all MSP, nothing to see) |
| sun_country | chr |
whether Sun Country Airlines is the cheapest option for
that day (yes or no) |
| date_recorded | Date |
date the price was “observed” |
# initialize data frame
d = NULL
# read in four files, one for each Monday tracked
dates <- c("03-18", "03-25", "04-08", "04-15")
filenames <- paste0(dates, "-2024_ewr_msp.csv")
for (f in filenames) {
d <- rbind(d, read.csv(f))
}
# convert date_recorded to date object for plotting purposes
d$date_recorded <- as.Date(d$date_recorded, "%m-%d-%Y")
# discard rows where sun_country is yes, and add label for plotly hover text
d <- d %>%
filter(sun_country == 'no') %>%
mutate(txtlabel = paste0('Recorded: ', date_recorded,
'\nDays before: ', days_before,
'\n<b>Lowest price: $', lowest_price,'</b>'))
d %>%
select(flight_date, days_before, lowest_price, date_recorded) %>%
head()
## flight_date days_before lowest_price date_recorded
## 1 03-18-2024 85 224 2023-12-24
## 2 03-18-2024 84 224 2023-12-25
## 3 03-18-2024 83 224 2023-12-26
## 4 03-18-2024 82 224 2023-12-27
## 5 03-18-2024 81 224 2023-12-28
## 6 03-18-2024 80 224 2023-12-29
For the record, I didn’t actually start recording dates until four
days ago (February 19), but Google Flights price history can go back as
far as 60 days. Let’s just say that date_recorded
corresponds to the day you would have known what the cheapest price was,
even though we acknowledge having this retrospective power in our hands.
This column will be useful for one of my plots below… stay tuned!
I’ll also make a subset of the data that only contains the observations that aren’t the same as the preceding or following day’s lowest price, which will be useful for adding markers to our line plots.
# thanks to
# https://stackoverflow.com/questions/37610056/how-to-treat-nas-like-values-when-comparing-elementwise-in-r
# for the `%!=na%` function, to compare numbers to NA's
`%!=na%` <- function(e1, e2) {
e1 != e2 | (is.na(e1) & !is.na(e2)) | (is.na(e2) & !is.na(e1)) & !(is.na(e1) & is.na(e2))
}
# discard "middle" prices
nomids <- d %>%
filter(date_recorded <= '2024-02-23') %>%
group_by(flight_date) %>%
mutate(prev = lag(lowest_price), follow = lead(lowest_price)) %>%
mutate(keep = (prev %!=na% lowest_price | follow %!=na% lowest_price)) %>%
filter(keep) %>%
ungroup()
Oof, this is a lot of code. I’m going to try to make this as
interactive as possible, so I’m using the plotly package
that enables a text bubble to appear when you hover your cursor over the
lines. Here, I plot the lowest price for each day against the number of
days before the departure date that it was recorded. I’m also going to
include a few purple horizontal lines ($120, $140, and $160) to indicate
the price range that I’m willing to pay for a one-way ticket.
Hovering over the lines/markers will show how many days before departure date and the cheapest price before recorded.
As of February 23, we are about 6.5 weeks before departure. The last substantial drop in prices for the other two flights seems to happen around 5.5 weeks… so I’m hoping to see a similar trend for the April 8th flight.
I wanted to also make a plot of prices against the date they were observed.
Weirdly enough, it looks like March 25 and April 8 have been following the same exact price trend for the last seven days, even though these flights are two weeks apart. What does this forbode for next week’s prices?
Here is the same information using a different type of interactive feature: a vertical line that follows the cursor. This is useful for comparing the prices that were recorded on the same day for each of the flights.
mar18 <- d[d$flight_date == '03-18-2024' & !is.na(d$lowest_price) & d$date_recorded <= '2024-02-23',
'lowest_price']
mar25 <- d[d$flight_date == '03-25-2024' & !is.na(d$lowest_price) & d$date_recorded <= '2024-02-23',
'lowest_price']
apr08 <- d[d$flight_date == '04-08-2024' & !is.na(d$lowest_price) & d$date_recorded <= '2024-02-23',
'lowest_price']
dates <- d[d$flight_date == '04-08-2024' & !is.na(d$lowest_price) & d$date_recorded <= '2024-02-23',
'date_recorded']
data <- data.frame(dates, mar18, mar25, apr08)
fig <- plot_ly(data, x = ~dates, y = ~mar18, name = 'March 18', type = 'scatter', mode = 'lines')
fig <- fig %>% add_trace(y = ~mar25, name = 'March 25', mode = 'lines+markers')
fig <- fig %>% add_trace(y = ~apr08, name = 'April 08', mode = 'markers')
fig <- fig %>%
layout(hovermode = "x unified",
hoverlabel = list(align = "right"),
legend = list(title = list(text = "Flight Date")),
title = list(text = "Flight Prices vs. Date Recorded",
x = 0.5, y = 0.98),
xaxis = list(title = "Date Recorded"),
yaxis = list(title = "Lowest Price",
tickprefix = "$",
range = list(110, 230)),
height = 400
)
div(fig, align = 'center')
As of Feburary 23, 45 days (or 6.5 weeks) before April 8, I am hoping that the cheapest price will make one final drop sometime next week. I hope this won’t come back to bite me in the butt, but I shall wait until then to buy the ticket.
11 days later… I have reappeared from the shadows to report that the prices have indeed dropped! Well, price (singular) would be more accurate, since the cheapest nonstop price on March 18 and March 25 jumped by about $80 today, while that of April 8 dropped by $80.
It’s been a tense week and a half waiting on April 8’s price to drop, especially on February 25, when all other prices dropped to a reasonable price EXCEPT for April 8. I was so puzzled as to why it didn’t drop, and I wondered if I too blindly trusted this extremely tiny dataset of two preceding Mondays’ three months worth of price history… but I’m glad I waited, because it not only dropped but dropped to a price I’d be willing to pay.
In the following plot, I chopped off the top of the y-axis to make the numbers easier to read, but for the record, after Feburary 26, the price for March 18 jumped to $305, and today (March 5), the price for March 25 jumped to $305:
I was really expecting a price drop for April 8 about 5.5 weeks before the departure date, but it ended up happening under 5 weeks from the departure date, as seen in the updated plot below (green). It’s not good for the heart to rely on price drops within 5 weeks or less before the departure date, but it seemed to work out this time. (Phew!) Below, the y-axis is chopped off once again:
Lady Luck seemed to finally shine on me today. Or could this have been foreseen earlier? Whichever the case, I’m happy.
…to the following forum posts for teaching me lots of things about
plotly and ggplot2 that I didn’t know
before: